데이터 feature

Name

이름이 있는지 없는지

AnimalType

고양이인지 강아지인지

SexuponOutcome

모두 사용

AgeuponOutcome

13년 이상인 경우 통일시켜줌. 모두 사용

Breed

그 종에서 확실히 구분이 되는 종만 구분해서 사용

Color

색중에서 확실히 구분되는 색만 구분해서 사용

DateTime

시간만 유효.(년, 월, 일은 특징이 없음)

OutcomeType

y

OutcomeSubtype

필요없음

AnimalID

필요없음

OutcomeType 중 died 샘플이 많지 않음.

died까지 예측하는 모델과 died를 제외하고 예측하는 모델을 따로 만들어보자.


In [85]:
df = pd.read_csv('train.csv')

In [86]:
#이름 encoding
df.Name = df.Name.fillna(0)
df.Name[df.Name!=0] = 1

In [87]:
#type encoding  강아지면 1 고양이는 0
df.AnimalType = df.AnimalType.apply(lambda x: 1 if x=='Dog' else 0)

In [88]:
def check_over13years(x):
    if len(x)>7:
        if x[-5:] == 'years' and int(x[:-5])>=13:
            return 'over 13 years'
    
    return x

In [89]:
# ageuponoutcome. 13년 이상된 강아지는 13년 이상으로 통일시키자
# 데이터가 없는 건 0 years랑 굉장히 유사한 데이터. 통일시키기!
df.AgeuponOutcome = df.AgeuponOutcome.fillna('0 years')
df.AgeuponOutcome = df.AgeuponOutcome.apply(check_over13years)

In [90]:
#필요없는 column제거
df = df.drop(['AnimalID', 'OutcomeSubtype'], axis=1)

In [91]:
#종 분류. 아래 종만 사용!
breeds = ['Labrador Retriever', 
'German Shepherd', 
'Golden Retriever', 
'Beagle', 
'Bulldog', 
'Yorkshire Terrier', 
'Boxer', 
'Poodle', 
'Rottweiler', 
'Siberian Husky', 
'Maltese', 
'Persian',
'Maine Coon',
'Siamese',
'American Shorthair', 
'Swedish Vallhund',
'Finnish', 
'Catahoula', 
'Ridgeback', 
'Carolina', 
'Manx',
'Domestic Shorthair',
'Pit Bull',
'Chihuahua',
'Domestic Medium Hair',
'Domestic Longhair',
'Dachshund',
'Rat Terrier',
'Miniature Schnauzer',
'Cairn Terrier',
'Shih Tzu']

In [92]:
def check_in_breeds(x):
    for breed in breeds:
        if x.count(breed) > 0:
            return breed

In [93]:
# 원하는 종이면 그대로 두고 아니면 None으로 두기
df.Breed = df.Breed.apply(check_in_breeds)

In [94]:
## 색 개수 30개 이하인건 others로 빼기
list_color = list(df.Color)
list_color_over50 = []
for color in set(list_color):
    if list_color.count(color) >= 50:
        list_color_over50.append(color)

In [95]:
print('색 종류 ->',len(set(list_color)))
print('샘플이 50개 이상인 색 종류 -> ', len(list_color_over50))


색 종류 -> 366
샘플이 50개 이상인 색 종류 ->  60

In [96]:
def check_in_colors(x):
    if x in list_color_over50:
        return x

In [97]:
# 샘플이 50개 이상인 색이면 그대로 두고 아니면 None으로 채우기
df.Color = df.Color.apply(check_in_colors)

In [98]:
#시간 분류
df['hour'] = df.DateTime.apply(lambda x:x[11:13])

In [99]:
# 5~8까지 묶음, 20~22까지 묶음, 23~0 묶음, 나머지 그대로
def check_hour(x):
    if x in ['03', '05', '06', '07']:
        return '5_8'
    elif x in ['20', '21', '22']:
        return '20_22'
    elif x in ['23', '00']:
        return '23_0'
    else:
        return x

In [100]:
df.hour = df.hour.apply(check_hour)

In [101]:
df.hour.unique()


Out[101]:
array(['18', '12', '19', '13', '17', '5_8', '15', '14', '11', '16', '23_0',
       '09', '10', '08', '20_22'], dtype=object)

Prediction

Logistic Regression, RandomForest, SVM 등을 사용하기

Died까지 예측하는 모형


In [102]:
X = df.drop('OutcomeType', axis=1)

In [103]:
y = df.OutcomeType

In [104]:
X_dummy = X_dummy = pd.get_dummies(X.ix[:, ['SexuponOutcome', 'AgeuponOutcome', 'Breed', 'Color', 'hour']])

In [105]:
X_dummy['Name'] = X.Name
X_dummy['AnimalType'] = X.AnimalType

In [106]:
from sklearn.cross_validation import train_test_split

In [107]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, test_size=0.20, random_state=42)

In [108]:
from sklearn.ensemble import RandomForestClassifier

RandomForest


In [109]:
# using RandomForest
model_rf = RandomForestClassifier(n_estimators=30)
result_rf = model_rf.fit(X_train, y_train)
result_rf.score(X_test, y_test)


Out[109]:
0.63224841002618781

In [110]:
df_importance = pd.DataFrame(zip(X_dummy.columns, model_rf.feature_importances_), columns=['colname', 'importance'])
df_importance.sort_values('importance', ascending=False)


Out[110]:
colname importance
148 Name 0.057559
0 SexuponOutcome_Intact Female 0.036039
17 AgeuponOutcome_2 months 0.034948
2 SexuponOutcome_Neutered Male 0.034786
1 SexuponOutcome_Intact Male 0.031827
3 SexuponOutcome_Spayed Female 0.030567
77 Color_Black/White 0.021285
149 AnimalType 0.020174
19 AgeuponOutcome_2 years 0.018456
142 hour_17 0.017932
73 Color_Black 0.016367
53 Breed_Domestic Shorthair 0.016320
143 hour_18 0.015980
49 Breed_Chihuahua 0.015542
10 AgeuponOutcome_1 year 0.015454
57 Breed_Labrador Retriever 0.014939
141 hour_16 0.014232
137 hour_12 0.014019
140 hour_15 0.013881
139 hour_14 0.013808
138 hour_13 0.012561
63 Breed_Pit Bull 0.012241
136 hour_11 0.011129
4 SexuponOutcome_Unknown 0.010868
21 AgeuponOutcome_3 months 0.010414
23 AgeuponOutcome_3 years 0.010218
146 hour_23_0 0.010139
90 Color_Brown/White 0.010057
116 Color_Tan/White 0.009657
121 Color_White 0.009642
... ... ...
112 Color_Sable/White 0.001404
94 Color_Chocolate/Tan 0.001343
105 Color_Lynx Point 0.001334
113 Color_Seal Point 0.001252
89 Color_Brown/Tan 0.001206
8 AgeuponOutcome_1 week 0.001152
127 Color_White/Gray 0.001140
98 Color_Cream Tabby/White 0.001090
101 Color_Flame Point 0.001083
131 Color_White/Tricolor 0.000988
145 hour_20_22 0.000936
102 Color_Gold 0.000919
16 AgeuponOutcome_2 days 0.000834
128 Color_White/Orange Tabby 0.000806
66 Breed_Ridgeback 0.000717
118 Color_Torbie/White 0.000714
20 AgeuponOutcome_3 days 0.000686
47 Breed_Carolina 0.000633
28 AgeuponOutcome_5 days 0.000545
5 AgeuponOutcome_0 years 0.000500
58 Breed_Maine Coon 0.000423
6 AgeuponOutcome_1 day 0.000404
32 AgeuponOutcome_6 days 0.000380
24 AgeuponOutcome_4 days 0.000330
54 Breed_Finnish 0.000319
30 AgeuponOutcome_5 weeks 0.000306
60 Breed_Manx 0.000299
62 Breed_Persian 0.000287
42 Breed_American Shorthair 0.000153
71 Breed_Swedish Vallhund 0.000105

150 rows × 2 columns


In [111]:
print(metrics.classification_report(y_test, model_rf.predict(X_test)))


             precision    recall  f1-score   support

   Adoption       0.68      0.74      0.71      2219
       Died       0.00      0.00      0.00        33
 Euthanasia       0.35      0.20      0.26       298
Return_to_owner       0.46      0.42      0.44       961
   Transfer       0.68      0.69      0.69      1835

avg / total       0.62      0.63      0.62      5346


In [112]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, model_rf.predict(X_test))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ RandomForest')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[1643    1   21  275  279]
 [   3    0    3    1   26]
 [  43    1   60   62  132]
 [ 372    0   23  406  160]
 [ 354    0   63  147 1271]]
Out[112]:
<matplotlib.text.Text at 0x109860790>

In [113]:
model_rf.predict_proba(X_test)


Out[113]:
array([[ 0.4       ,  0.        ,  0.23333333,  0.26666667,  0.1       ],
       [ 0.66944444,  0.        ,  0.        ,  0.        ,  0.33055556],
       [ 0.        ,  0.        ,  0.03333333,  0.        ,  0.96666667],
       ..., 
       [ 0.99444444,  0.        ,  0.        ,  0.        ,  0.00555556],
       [ 0.        ,  0.01666667,  0.        ,  0.        ,  0.98333333],
       [ 0.35      ,  0.        ,  0.        ,  0.55      ,  0.1       ]])

Logistic Regression


In [114]:
from sklearn.linear_model import LogisticRegression

In [115]:
from sklearn import metrics

In [116]:
model_lr = LogisticRegression(C=1e5).fit(X_train, y_train)
print(metrics.classification_report(y_test, model_lr.predict(X_test)))


             precision    recall  f1-score   support

   Adoption       0.67      0.85      0.75      2219
       Died       1.00      0.03      0.06        33
 Euthanasia       0.52      0.11      0.18       298
Return_to_owner       0.49      0.41      0.45       961
   Transfer       0.73      0.66      0.69      1835

avg / total       0.65      0.66      0.64      5346


In [117]:
model_lr.score(X_test, y_test)


Out[117]:
0.65918443696221474

In [118]:
cm = confusion_matrix(y_test, model_lr.predict(X_test))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ Logistic Regression')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[1890    0    3  193  133]
 [   2    1    4    0   26]
 [  42    0   33   72  151]
 [ 431    0    4  395  131]
 [ 469    0   20  141 1205]]
Out[118]:
<matplotlib.text.Text at 0x10b4d8350>

In [119]:
model_lr.predict_proba(X_test)


Out[119]:
array([[  2.19453771e-01,   1.57030382e-03,   2.34032869e-01,
          4.21208808e-01,   1.23734248e-01],
       [  8.52733288e-01,   2.08851912e-03,   1.67755384e-02,
          4.46858506e-03,   1.23934070e-01],
       [  3.02549631e-05,   1.21143190e-06,   1.22835006e-01,
          4.58867370e-03,   8.72544854e-01],
       ..., 
       [  8.26766837e-01,   6.51437676e-04,   3.75934604e-03,
          1.15783977e-02,   1.57243982e-01],
       [  2.84419127e-06,   1.85681636e-02,   1.30407412e-01,
          3.67643513e-03,   8.47345145e-01],
       [  5.27098331e-01,   1.79364805e-07,   1.44817698e-02,
          4.04053472e-01,   5.43662476e-02]])

In [121]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet

In [122]:
lasso1 = Lasso(alpha = 0).fit(X_train, y_train)


/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
  if __name__ == '__main__':
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-122-306dd26fd5fc> in <module>()
----> 1 lasso1 = Lasso(alpha = 0).fit(X_train, y_train)

/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.pyc in fit(self, X, y, check_input)
    654         # when bypassing checks
    655         if check_input:
--> 656             y = np.asarray(y, dtype=np.float64)
    657             X, y = check_X_y(X, y, accept_sparse='csc', dtype=np.float64,
    658                              order='F',

/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    472 
    473     """
--> 474     return array(a, dtype, copy=False, order=order)
    475 
    476 def asanyarray(a, dtype=None, order=None):

ValueError: could not convert string to float: Adoption

In [123]:
Lasso?

In [ ]:

Kernel SVM


In [65]:
from sklearn.svm import SVC

In [66]:
model_svc = SVC(probability=True).fit(X_train, y_train)

In [68]:
print(metrics.classification_report(y_test, model_svc.predict(X_test)))


             precision    recall  f1-score   support

   Adoption       0.60      0.95      0.73      2219
       Died       0.00      0.00      0.00        33
 Euthanasia       0.00      0.00      0.00       298
Return_to_owner       0.57      0.20      0.30       961
   Transfer       0.76      0.62      0.68      1835

avg / total       0.61      0.64      0.59      5346

/Users/eunseopjeoung/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

In [69]:
model_svc.score(X_test, y_test)


Out[69]:
0.64216236438458663

In [71]:
cm = confusion_matrix(y_test, model_svc.predict(X_test))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ SVM')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[2100    0    0   41   78]
 [   5    0    0    0   28]
 [ 100    0    0   33  165]
 [ 672    0    0  196   93]
 [ 625    0    0   73 1137]]
Out[71]:
<matplotlib.text.Text at 0x111ddb910>
세가지를 Esemble로 사용해보자.
1. Voting형식

In [85]:
from sklearn.ensemble import VotingClassifier

In [84]:
clf1 = LogisticRegression(C=1e5, random_state=213)
clf2 = RandomForestClassifier(n_estimators=30, random_state=123)
clf3 = SVC(probability=True)

In [86]:
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')

In [93]:
# 오래걸리니까 잠시후에 해보자
eclf.fit(X_train, y_train)


Out[93]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)

In [94]:
eclf.score(X_test, y_test)


Out[94]:
0.65544332210998879

In [95]:
predict_eclf = eclf.predict(X_test)

In [96]:
print(metrics.classification_report(y_test, predict_eclf))


             precision    recall  f1-score   support

   Adoption       0.66      0.84      0.74      2219
       Died       0.00      0.00      0.00        33
 Euthanasia       0.73      0.05      0.10       298
Return_to_owner       0.49      0.40      0.44       961
   Transfer       0.72      0.67      0.70      1835

avg / total       0.65      0.66      0.63      5346


In [99]:
cm = confusion_matrix(y_test, eclf.predict(X_test))

print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix')
plt.colorbar()
outcomes = sorted(y_test.unique())
tick_marks = np.arange(len(set(list(y_test))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[1870    0    1  201  147]
 [   4    0    2    0   27]
 [  52    0   16   69  161]
 [ 434    0    1  388  138]
 [ 462    0    2  141 1230]]
Out[99]:
<matplotlib.text.Text at 0x11353e610>

kaggle에 제출.


In [105]:
eclf_kaggle = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')

In [109]:
X_dummy = X_dummy.drop('OutcomeType', axis=1)

In [110]:
eclf_kaggle.fit(X_dummy, y)


Out[110]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)

In [112]:
df_test = pd.read_csv('test.csv')
df_test = df_test.drop(['ID'], axis=1)

#이름 encoding
df_test.Name = df_test.Name.fillna(0)
df_test.Name[df_test.Name!=0] = 1

#type encoding  강아지면 1 고양이는 0
df_test.AnimalType = df_test.AnimalType.apply(lambda x: 1 if x=='Dog' else 0)

# ageuponoutcome. 13년 이상된 강아지는 13년 이상으로 통일시키자
df_test.AgeuponOutcome = df_test.AgeuponOutcome.fillna('0 years')
df_test.AgeuponOutcome = df_test.AgeuponOutcome.apply(check_over13years)

#종
df_test.Breed = df_test.Breed.apply(check_in_breeds)

#색
df_test.Color = df_test.Color.apply(check_in_colors)

#시간
df_test['hour'] = df_test.DateTime.apply(lambda x:x[11:13])
df_test.hour = df_test.hour.apply(check_hour)
df_test = df_test.drop('DateTime', axis=1)

In [113]:
X_test_dummy = pd.get_dummies(df_test.ix[:, ['SexuponOutcome', 'AgeuponOutcome', 
                                             'Breed', 'Color', 'hour']])

In [114]:
X_test_dummy['Name'] = df_test.Name
X_test_dummy['AnimalType'] = df_test.AnimalType

In [115]:
print(X_test_dummy.columns[130:150])
print(X_dummy.columns[130:150])


Index([u'Color_White/Tan', u'Color_White/Tricolor', u'Color_Yellow',
       u'hour_08', u'hour_09', u'hour_10', u'hour_11', u'hour_12', u'hour_13',
       u'hour_14', u'hour_15', u'hour_16', u'hour_17', u'hour_18', u'hour_19',
       u'hour_20_22', u'hour_23_0', u'hour_5_8', u'Name', u'AnimalType'],
      dtype='object')
Index([u'Color_White/Tan', u'Color_White/Tricolor', u'Color_Yellow',
       u'hour_08', u'hour_09', u'hour_10', u'hour_11', u'hour_12', u'hour_13',
       u'hour_14', u'hour_15', u'hour_16', u'hour_17', u'hour_18', u'hour_19',
       u'hour_20_22', u'hour_23_0', u'hour_5_8', u'Name', u'AnimalType'],
      dtype='object')

In [154]:
result_predict = eclf_kaggle.predict(X_test_dummy)

In [155]:
df_result = pd.DataFrame(columns=['ID','Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
df_result['ID'] = range(1, len(result_predict)+1)
count = 1
for predict_val in result_predict:
    
    df_result.loc[df_result.ID==count, predict_val] = 1
    count+=1
df_result = df_result.fillna(0)
df_result.index = df_result.ID
df_result = df_result.drop('ID', axis=1)
df_result.to_csv('submission2.csv')

결합확률을 통한 앙상블


In [116]:
model_rf = RandomForestClassifier()
model_rf.fit(X_dummy, y)
model_lf = LogisticRegression()
model_lf.fit(X_dummy, y)
result_lf = model_lf.predict_proba(X_test_dummy)
result_rf = model_rf.predict_proba(X_test_dummy)
rs = (result_rf*result_lf)
np.argmax(rs[1])
model_rf.predict(X_test)
model_rf.predict_proba(X_test)
result_dict = {}
result_dict[0] = 'Adoption'
result_dict[1] = 'Died'
result_dict[2] = 'Euthanasia'
result_dict[3] = 'Return_to_owner'
result_dict[4] = 'Transfer'
rs2 = []
for i in rs.argmax(axis=1):
    rs2.append(result_dict[i])

In [118]:
df_result = pd.DataFrame(columns=['ID','Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
df_result['ID'] = range(1, len(rs)+1)
count = 1
for predict_val in rs2:
    
    df_result.loc[df_result.ID==count, predict_val] = 1
    count+=1
df_result = df_result.fillna(0)
df_result.index = df_result.ID
df_result = df_result.drop('ID', axis=1)
df_result.to_csv('submission2.csv')

Died 뺀 예측 모형.


In [72]:
# df2는 outcometype이 died인 샘플을 제외한 dataframe.
df2 = X_dummy
df2['OutcomeType'] = y

In [74]:
df2 = df2[df2.OutcomeType != 'Died']

In [76]:
X_dummy2 = df2.drop('OutcomeType', axis=1)

In [78]:
y2 = df2.OutcomeType

In [80]:
# train test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_dummy2, y2, test_size=0.20, random_state=42)

RanomForest


In [81]:
# using RandomForest
model_rf2 = RandomForestClassifier(n_estimators=30)
model_rf2.fit(X_train2, y_train2)
model_rf2.score(X_test2, y_test2)


Out[81]:
0.63576408517052951

In [83]:
print(metrics.classification_report(y_test2, model_rf2.predict(X_test2)))


             precision    recall  f1-score   support

   Adoption       0.68      0.75      0.71      2149
 Euthanasia       0.38      0.20      0.26       316
Return_to_owner       0.44      0.40      0.42       946
   Transfer       0.69      0.70      0.69      1896

avg / total       0.62      0.64      0.63      5307


In [90]:
cm = confusion_matrix(y_test2, model_rf2.predict(X_test2))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ randomforest')
plt.colorbar()
outcomes = sorted(y_test2.unique())
tick_marks = np.arange(len(set(list(y_test2))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[1611   14  256  268]
 [  50   63   63  140]
 [ 359   25  379  183]
 [ 355   63  157 1321]]
Out[90]:
<matplotlib.text.Text at 0x10cac7e10>

Logistic Regression


In [87]:
model_lr2 = LogisticRegression(C=1e5).fit(X_train2, y_train2)
print(metrics.classification_report(y_test2, model_lr2.predict(X_test2)))


             precision    recall  f1-score   support

   Adoption       0.66      0.84      0.74      2149
 Euthanasia       0.63      0.11      0.19       316
Return_to_owner       0.48      0.42      0.45       946
   Transfer       0.75      0.67      0.71      1896

avg / total       0.66      0.66      0.64      5307


In [89]:
model_lr2.score(X_test, y_test)


Out[89]:
0.66086793864571647

In [91]:
cm = confusion_matrix(y_test2, model_lr2.predict(X_test2))
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ logistic regression')
plt.colorbar()
outcomes = sorted(y_test2.unique())
tick_marks = np.arange(len(set(list(y_test2))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[1803    0  211  135]
 [  46   36   64  170]
 [ 423    7  396  120]
 [ 460   14  147 1275]]
Out[91]:
<matplotlib.text.Text at 0x111db9810>

In [98]:
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')

In [100]:
eclf2.fit(X_train2, y_train2)


Out[100]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)

In [101]:
eclf2.score(X_test2, y_test2)


Out[101]:
0.66968155266628981

In [102]:
predict_eclf2 = eclf2.predict(X_test2)

In [103]:
print(metrics.classification_report(y_test2, predict_eclf2))


             precision    recall  f1-score   support

   Adoption       0.66      0.86      0.75      2149
 Euthanasia       0.86      0.06      0.11       316
Return_to_owner       0.50      0.42      0.46       946
   Transfer       0.76      0.68      0.72      1896

avg / total       0.68      0.67      0.65      5307


In [104]:
cm = confusion_matrix(y_test2, predict_eclf2)
print(cm)
for i in range(len(cm)):
    cm[i, :] = (((cm[i, :]) /(sum(cm[i, :]))) *100)
plt.grid(False)
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix _ esemble')
plt.colorbar()
outcomes = sorted(y_test2.unique())
tick_marks = np.arange(len(set(list(y_test2))))
plt.xticks(tick_marks, outcomes, rotation=45)
plt.yticks(tick_marks, outcomes)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')


[[1841    0  189  119]
 [  65   19   64  168]
 [ 424    2  397  123]
 [ 452    1  146 1297]]
Out[104]:
<matplotlib.text.Text at 0x124a1c1d0>

kaggle에 제출


In [156]:
eclf_kaggle2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft')

In [157]:
eclf_kaggle2.fit(X_dummy2, y2)


Out[157]:
VotingClassifier(estimators=[('lr', LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=213,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('rf', Ran...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         voting='soft', weights=None)

In [158]:
result_predict2 = eclf_kaggle2.predict(X_test_dummy)

In [159]:
df_result = pd.DataFrame(columns=['ID','Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'])
df_result['ID'] = range(1, len(result_predict2)+1)
count = 1
for predict_val in result_predict2:
    
    df_result.loc[df_result.ID==count, predict_val] = 1
    count+=1
df_result = df_result.fillna(0)
df_result.index = df_result.ID
df_result = df_result.drop('ID', axis=1)
df_result.to_csv('submission3.csv')

In [1]:



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-60b725f10c9c> in <module>()
----> 1 a

NameError: name 'a' is not defined

In [ ]: